A unified view of TD algorithms, introducing Full-gradient TD and Equi-gradient descent TD

نویسندگان

  • Manuel Loth
  • Philippe Preux
  • Manuel Davy
چکیده

This paper addresses the issue of policy evaluation in Markov Decision Processes, using linear function approximation. It provides a unified view of algorithms such as TD(λ), LSTD(λ), iLSTD, residual-gradient TD. It is asserted that they all consist in minimizing a gradient function and differ by the form of this function and their means of minimizing it. Two new schemes are introduced in that framework: Full-gradient TD which uses a generalization of the principle introduced in iLSTD, and EGD TD, which reduces the gradient by successive equi-gradient descents. These three algorithms form a new intermediate family with the interesting property of making much better use of the samples than TD while keeping a gradient descent scheme, which is useful for complexity issues and optimistic policy iteration. 1 The policy evaluation problem A Markov Decision Process (MDP) describes a dynamical system and an agent. The system is described by its state s ∈ S. When considering discrete time, the agent can apply at each time step an action u ∈ U which drives the system to a state s = u(s) at the next time step. u is generally non-deterministic. To each transition is associated a reward r ∈ R ⊂ R. A policy π is a function that associates to any state of the system an action taken by the agent. Given a discount factor γ, the value function v of a policy π associates to any state the expected discounted sum of rewards received when applying π from that state for an infinite time:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

University of Alberta Gradient Temporal - Difference Learning Algorithms

We present a new family of gradient temporal-difference (TD) learning methods with function approximation whose complexity, both in terms of memory and per-time-step computation, scales linearly with the number of learning parameters. TD methods are powerful prediction techniques, and with function approximation form a core part of modern reinforcement learning (RL). However, the most popular T...

متن کامل

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximat...

متن کامل

Image Restoration with Two-Dimensional Adaptive Filter Algorithms

Two-dimensional (TD) adaptive filtering is a technique that can be applied to many image, and signal processing applications. This paper extends the one-dimensional adaptive filter algorithms to TD structures and the novel TD adaptive filters are established. Based on this extension, the TD variable step-size normalized least mean squares (TD-VSS-NLMS), the TD-VSS affine projection algorithms (...

متن کامل

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Sutton, Szepesvári and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. Although their “gradient temporal difference” (GTD) algorithm converges reliably, it can be very slow compared to conventional linear...

متن کامل

On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning

We consider off-policy temporal-difference (TD) learning methods for policy evaluation in Markov decision processes with finite spaces and discounted reward criteria, and we present a collection of convergence results for several gradient-based TD algorithms with linear function approximation. The algorithms we analyze include: (i) two basic forms of two-time-scale gradient-based TD algorithms,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007